欢迎关注“小丫画图”公众号,同名知识星球等你加入
小丫微信: epigenomics E-mail: figureya@126.com
作者:热心肠研究院 高春辉https://github.com/gaospecial
小丫编辑校验
用R从形式上复现原图。
出自https://academic.oup.com/jnci/article-lookup/doi/10.1093/jnci/djy156
Figure 1. Plot of all alterations detected by plasma next-generation sequencing (n=210). Size of circles represents number of patients identified with an alteration.
图的解析
例文用来展示基因上的 SNP/Indel/CNV 突变(多态性差异)。 例如,在TP53基因中发现了最多的多态性,包括 CNV 差异 SPLICE 8 个,单核苷酸位点差异 R273H 5个等。用点的大小表示差异,所以很容易发现常见的多态性差异。
稍作观察便可发现,本图是一个“圆环套圆环”的布局,中心处在中央,下一级的项目分别处在外环。图中只有二环,如果要扩展成五环,多显示几个层次应该也不错。
原文用的是在线工具
推测原文是用这个工具画的:FuncTreehttps://bioviz.tokyo/functree/,能画出类似的图,用来展示基因组数据的关系。
缺点:输入数据格式复杂非常复杂。需要针对每一个点做有针对性的设置。感兴趣的小伙伴去尝试一下吧~
如何用R实现
小丫发现tidytuesday2019-11-12的图跟原图很像,代码https://github.com/spren9er/tidytuesday/blob/master/tidytuesday_201946_cran_packages.r,输入数据https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-11-12/loc_cran_packages.csv。
参考这套代码,用 ggraph 来画这个图。为此,花了几天时间仔细研究了ggraph包,写下了一篇长文:一文读懂 ggraph 的使用
展示层级结构。例如:
注:
加载R包
library(clusterProfiler)
##
## clusterProfiler v3.16.1 For help: https://guangchuangyu.github.io/software/clusterProfiler
##
## If you use clusterProfiler in published research, please cite:
## Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:stats':
##
## filter
library(GOplot)
## Loading required package: ggplot2
## Loading required package: ggdendro
## Loading required package: gridExtra
## Loading required package: RColorBrewer
library(tidyverse)
## ─ Attaching packages ──────────────────── tidyverse 1.3.0 ─
## ✓ tibble 3.0.4 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ✓ purrr 0.3.4
## ─ Conflicts ───────────────────── tidyverse_conflicts() ─
## x dplyr::combine() masks gridExtra::combine()
## x dplyr::filter() masks clusterProfiler::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::simplify() masks clusterProfiler::simplify()
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(ggraph)
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
Sys.setenv(LANGUAGE = "en") #显示英文报错信息
options(stringsAsFactors = FALSE) #禁止chr转成factor
加载自定义函数
source(file = "gather_graph_node.R")
source(file = "gather_graph_edge.R")
如果你的数据已经整理成very_easy_input.csv的格式,就可以跳过这步,直接进入“输入文件预处理”。
先用clusterProfiler做KEGG的GSEA,然后用例图的形式展示结果。
gsym.fc <- read.table("easy_input_rnk.txt", header = T)
dim(gsym.fc)
head(gsym.fc)
#把gene symbol转换为ENTREZ ID
#此处物种是人,其他物种的ID转换方法,请参考FigureYa52GOplot
gsym.id <- bitr(gsym.fc$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
#让基因名、ENTREZID、foldchange对应起来
gsym.fc.id <- merge(gsym.fc, gsym.id, by="SYMBOL", all=F)
#按照foldchange排序
gsym.fc.id.sorted <- gsym.fc.id[order(gsym.fc.id$logFC, decreasing = T),]
#获得ENTREZID、foldchange列表,做为GSEA的输入
id.fc <- gsym.fc.id.sorted$logFC
names(id.fc) <- gsym.fc.id.sorted$ENTREZID
#这一条语句就做完了KEGG的GSEA分析
kk <- gseKEGG(id.fc, organism = "hsa")
dim(kk)
#把ENTREZ ID转为gene symbol,便于查看通路里的基因
kk.gsym <- setReadable(kk, 'org.Hs.eg.db', #物种
'ENTREZID')
#可以用kk.gsym作为输入,用clusterProfiler画图
#用法看这里https://yulab-smu.github.io/clusterProfiler-book/chapter12.html
#gsym.fc.l <- gsym.fc$logFC
#names(gsym.fc.l) <- gsym.fc$SYMBOL
#cnetplot(sortkk, foldChange = gsym.fc.l, circular = TRUE)
#按照enrichment score从高到低排序,取前5(up)和后5(down)
#sortkk <- kk.gsym[order(kk.gsym@result$enrichmentScore, decreasing = T),][c(1:5, (nrow(kk.gsym)-5):nrow(kk.gsym)),]
#这里提取感兴趣的3个通路,数量太多拥挤的话不好看基因名
sortkk <- kk.gsym[kk.gsym@result$Description %like% "DNA" |
kk.gsym@result$Description %like% "cycle" |
kk.gsym@result$Description %like% "p53",]
#把富集分析结果整理为GOplot所需的格式
go <- data.frame(Category = "KEGG",
ID = sortkk$ID,
Term = sortkk$Description,
Genes = gsub("/", ", ", sortkk$core_enrichment),
adj_pval = sortkk$p.adjust)
#基因变化倍数
genelist <- data.frame(ID = gsym.fc.id$SYMBOL, logFC = gsym.fc.id$logFC)
#把富集分析和倍数整合在一起
circ <- circle_dat(go, genelist)
head(circ)
#可以用circ作为输入,用GOplot画图
#用法看这里https://wencke.github.io/
#GOCircle(circ)
#保存到文件
write.csv(circ[,c(3,5,6)],"very_easy_input.csv", quote = F, row.names = F)
very_easy_input.csv,这里以上面的富集分析结果为例,展示通路和通路里的基因变化倍数FC。三列依次是通路-基因-倍数,可以自由替换成“应用场景”里其他需要展示的信息。
gene_special.txt,要突出显示的基因。第一列是基因名,第二列是类型(例如基因家族信息)。
df <- read.csv("very_easy_input.csv")
head(df)
## term genes logFC
## 1 DNA replication MCM5 2.010821
## 2 DNA replication MCM2 1.956162
## 3 DNA replication MCM6 1.439526
## 4 DNA replication MCM4 1.422334
## 5 DNA replication FEN1 1.385611
## 6 DNA replication RFC4 1.266485
geneSpecial <- read.table("gene_special.txt", header = T)
geneCol <- geneSpecial$Type
names(geneCol) <- geneSpecial$Gene
geneCol
## MCM6 MCM2 CD6 CDK4 CHEK2 CHEK1
## "type1" "type1" "type2" "type2" "type3" "type3"
图由两个部分组成,节点(node)和边(edge)。
要从上面的数据框中采集节点和边的信息。
为此,我分别写了两个函数:gather_graph_node() 和 gather_graph_edge() 来完成这一个任务(前面已加载)。
这两个函数的参数设置借鉴了 treemap() 的实现方式。
df:一个数据框index:一个索引项(分组项)value:要采集的数值为了确保 node.name 的唯一性,在图中使用了长名,而把原有的名字放在 node.short_name 中去了。
node.level 则用来指示节点应该处于第几个圆环。
节点的属性统一以 node 作为前缀,而边的属性则以 edge 作为前缀。
nodes <- gather_graph_node(df, index = c("term", "genes"), value = "logFC", root="all")
## `summarise()` ungrouping output (override with `.groups` argument)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(dots)` instead of `dots` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## `summarise()` regrouping output by 'term' (override with `.groups` argument)
edges <- gather_graph_edge(df, index = c("term", "genes"), root = "all")
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(index)` instead of `index` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## `summarise()` ungrouping output (override with `.groups` argument)
nodes <- nodes %>% mutate_at(c("node.level","node.branch"),as.character)
head(nodes, 10)
## node.name node.size node.level node.count node.short_name
## 1 all 115.692338 all 1 all
## 2 Cell cycle 70.226424 term 46 Cell cycle
## 3 DNA replication 21.413204 term 21 DNA replication
## 4 p53 signaling pathway 24.052710 term 16 p53 signaling pathway
## 5 Cell cycle/BUB1 1.185457 genes 1 BUB1
## 6 Cell cycle/BUB1B 1.809404 genes 1 BUB1B
## 7 Cell cycle/CCNA2 2.446934 genes 1 CCNA2
## 8 Cell cycle/CCNB1 2.046811 genes 1 CCNB1
## 9 Cell cycle/CCNB2 3.125239 genes 1 CCNB2
## 10 Cell cycle/CCNE1 1.609160 genes 1 CCNE1
## node.branch
## 1 all
## 2 Cell cycle
## 3 DNA replication
## 4 p53 signaling pathway
## 5 Cell cycle
## 6 Cell cycle
## 7 Cell cycle
## 8 Cell cycle
## 9 Cell cycle
## 10 Cell cycle
head(edges, 10)
## # A tibble: 10 x 2
## from to
## <chr> <chr>
## 1 all Cell cycle
## 2 all DNA replication
## 3 all p53 signaling pathway
## 4 DNA replication DNA replication/MCM5
## 5 DNA replication DNA replication/MCM2
## 6 DNA replication DNA replication/MCM6
## 7 DNA replication DNA replication/MCM4
## 8 DNA replication DNA replication/FEN1
## 9 DNA replication DNA replication/RFC4
## 10 DNA replication DNA replication/PCNA
# 把要突出显示的基因类型信息加到nodes里
nodes$color <- "normal"
nodes[nodes$node.short_name %in% geneSpecial$Gene,]$color <- geneCol[nodes[nodes$node.short_name %in% geneSpecial$Gene,]$node.short_name]
nodes[nodes$node.short_name %in% geneSpecial$Gene,]
## node.name node.size node.level node.count node.short_name
## 20 Cell cycle/CDK4 0.7449589 genes 1 CDK4
## 23 Cell cycle/CHEK1 2.2365719 genes 1 CHEK1
## 24 Cell cycle/CHEK2 0.6951341 genes 1 CHEK2
## 32 Cell cycle/MCM2 1.9561624 genes 1 MCM2
## 36 Cell cycle/MCM6 1.4395262 genes 1 MCM6
## 53 DNA replication/MCM2 1.9561624 genes 1 MCM2
## 57 DNA replication/MCM6 1.4395262 genes 1 MCM6
## 79 p53 signaling pathway/CDK4 0.7449589 genes 1 CDK4
## 81 p53 signaling pathway/CHEK1 2.2365719 genes 1 CHEK1
## 82 p53 signaling pathway/CHEK2 0.6951341 genes 1 CHEK2
## node.branch color
## 20 Cell cycle type2
## 23 Cell cycle type3
## 24 Cell cycle type3
## 32 Cell cycle type1
## 36 Cell cycle type1
## 53 DNA replication type1
## 57 DNA replication type1
## 79 p53 signaling pathway type2
## 81 p53 signaling pathway type3
## 82 p53 signaling pathway type3
nodes$color <- factor(nodes$color, levels = unique(nodes$color))
# 有了节点和边的数据,使用 `tbl_graph()` 便可以得到一个图。
graph <- tbl_graph(nodes, edges)
# 用 `ggraph` 出图
gc <- ggraph(graph, layout = 'dendrogram', circular = TRUE) +
# 使用 filter 参数去掉 root(前面设置为"all")节点及与其相连的边
geom_edge_diagonal(aes(color = node1.node.branch,
filter=node1.node.level!="all"),
alpha = 1/3,edge_width=1) +
geom_node_point(aes(size = node.size,
color = node.branch,
filter=node.level!="all"), alpha = 1/3) +
scale_size(range = c(0.5,80)) + #做均一化处理,让点的大小介于range之间
theme(legend.position = "none") + #不画图例
# 点和边的配色
# 如果要改变边的配色,需要同时给边和点变色,否则会对应不上
scale_edge_color_brewer(palette = "Set1") + #用?scale_color_brewer查看更多配色方案
scale_color_brewer(palette = "Set1") +
# 添加周围注释文字,此处是基因名gene
geom_node_text(
aes(
x = 1.048 * x, #控制字跟点的距离
y = 1.048 * y, #控制字跟点的距离
label = node.short_name,
angle = -((-node_angle(x, y) + 90) %% 180) + 90,
filter = leaf,
color = node.branch
),
size = 6, hjust = 'outward') +
# 添加内环文字,此处是通路名term
geom_node_text(
aes(label=node.short_name,
filter = !leaf & (node.level != "all"),
color = node.branch),
fontface="bold",
size=6,
family="sans"
) +
theme(panel.background = element_rect(fill = NA)) +
coord_cartesian(xlim=c(-1.3,1.3),ylim = c(-1.3,1.3)) #扩大坐标系
gc
ggsave("ccgraph_color.pdf", width = 14, height = 14)
在上面的图形中,线条的颜色由 geom_edge_diagonal(aes(color = node1.node.branch)) 指定。
node1.node.branch 指的是出发点(node1)的 node.branch 属性。如果要改变线条颜色,可以修改 nodes 表,添加一个属性(如 color) ,然后在 geom_edge_diagonal() 中将其映射到 color 上即可。
gc1 <- ggraph(graph, layout = 'dendrogram', circular = TRUE) +
#画连线
geom_edge_diagonal(aes(color = node2.color,
filter=node1.node.level!="all"),
alpha = 0.5, #透明度
edge_width=2.5) + #连线的粗细
scale_edge_color_manual(values = c("#61C3ED","red","purple","darkgreen")) + #自定义颜色
#画点
geom_node_point(aes(size = node.size,
filter=node.level!="all"),
#alpha = 1/3,
color = "#61C3ED") + #统一为淡蓝色
scale_size(range = c(0.5,80)) + #做均一化处理,让点的大小介于range之间
theme(legend.position = "none") + #不画图例
# 添加周围注释文字,此处是基因名gene
geom_node_text(
aes(
x = 1.05 * x, #控制字跟点的距离
y = 1.05 * y, #控制字跟点的距离
label = node.short_name,
angle = -((-node_angle(x, y) + 90) %% 180) + 90,
filter = leaf
),
color="black", #统一为黑色字
size = 6, hjust = 'outward') +
# 添加内环文字,此处是通路名term
geom_node_text(
aes(label=node.short_name,
filter = !leaf & (node.level != "all")
),
color="black", #统一为黑色字
fontface="bold",
size=6,
family="sans"
) +
theme(panel.background = element_rect(fill = NA)) + #背景透明色
coord_cartesian(xlim=c(-1.3,1.3),ylim = c(-1.3,1.3)) #扩大坐标系
gc1
保存到pdf文件,是矢量图,可以用Illustrator等软件编辑图形、文字和背景
ggsave("ccgraph.pdf",width = 14,height = 14)
这套代码不仅可以画两层的图,理论上支持更多层(要不然怎么叫“圆环套圆环”呢?)。
下面是一个例子,这里使用了常见的微生物组数据集(这是一个随机生成的 OTU 表)。
#随机生成一套数据
n <- 1000
microbiome <- data.frame(
otu = paste("OTU",1:n,sep="_"),
phylum = sample(paste("phylum",1:5,sep="_"),n,replace = T),
class = sample(paste("class",6:30,sep="_"),n,replace=T),
order = sample(paste("order",31:80,sep="_"),n,replace = T),
value = runif(n,min=1,max=1000)
)
head(microbiome)
#保存到文件,便于小白套用格式
write.csv(microbiome, "microbiome.csv", quote = F, row.names = F)
加载输入数据
microbiome.csv,想画几层就给几+1列。这里前4列对应4层,最后一列是最底层节点对应的数值。
microbiome <- read.csv("microbiome.csv", header = T)
index_micro <- c("phylum","class","order") #除了最低层以外的列名
nodes_micro <- gather_graph_node(microbiome,index=index_micro,
root = "bacteria") #root名字自己随便取
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` regrouping output by 'phylum' (override with `.groups` argument)
## `summarise()` regrouping output by 'phylum', 'class' (override with `.groups` argument)
edges_micro <- gather_graph_edge(microbiome,index=index_micro,root = "bacteria")
## `summarise()` ungrouping output (override with `.groups` argument)
画图
graph_micro <- tbl_graph(nodes_micro,edges_micro)
ggraph(graph_micro,layout = "dendrogram",circular=T) +
geom_edge_diagonal(aes(color = node1.node.branch,filter=node1.node.level!="bacteria", alpha = node1.node.level),edge_width=1) +
geom_node_point(aes(size = node.size, color = node.branch,filter=node.level!="bacteria"), alpha = 1/3) +
scale_size(range = c(0.5,80)) + #做均一化处理,让点的大小介于range之间
theme(legend.position = "none")+ #不画图例
scale_edge_color_brewer(palette = "Set1") + #用?scale_color_brewer查看更多配色方案
scale_color_brewer(palette = "Set1") +
# 添加周围注释文字,此处是基因名gene
geom_node_text(
aes(
x = 1.058 * x, #控制字跟点的距离
y = 1.058 * y, #控制字跟点的距离
label = node.short_name,
angle = -((-node_angle(x, y) + 90) %% 180) + 90,
filter = leaf,
color = node.branch
),
size = 1, hjust = 'outward') +
# 添加内环文字,此处是通路名term
geom_node_text(
aes(label=node.short_name,
filter = !leaf & (node.level == "phylum"),
color = node.branch),
fontface="bold",
size=6,
family="sans"
) +
theme(panel.background = element_rect(fill = NA)) +
coord_cartesian(xlim=c(-1.3,1.3),ylim = c(-1.3,1.3)) #扩大坐标系
ggsave("ccgraph_microbiome.pdf", width = 14, height = 14)
搞大你的点https://mp.weixin.qq.com/s/JNIncz3W-59yjGk2ibJWUw
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] zh_CN.UTF-8/zh_CN.UTF-8/zh_CN.UTF-8/C/zh_CN.UTF-8/zh_CN.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidygraph_1.2.0 ggraph_2.0.4 data.table_1.13.4
## [4] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
## [7] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
## [10] tibble_3.0.4 tidyverse_1.3.0 GOplot_1.0.2
## [13] RColorBrewer_1.1-2 gridExtra_2.3 ggdendro_0.1.22
## [16] ggplot2_3.3.2 clusterProfiler_3.16.1
##
## loaded via a namespace (and not attached):
## [1] fgsea_1.14.0 colorspace_2.0-0 ellipsis_0.3.1
## [4] ggridges_0.5.2 qvalue_2.20.0 fs_1.5.0
## [7] rstudioapi_0.13 farver_2.0.3 urltools_1.7.3
## [10] graphlayouts_0.7.1 ggrepel_0.9.0 bit64_4.0.5
## [13] AnnotationDbi_1.50.3 fansi_0.4.1 scatterpie_0.1.5
## [16] lubridate_1.7.9.2 xml2_1.3.2 splines_4.0.2
## [19] GOSemSim_2.14.2 knitr_1.30 polyclip_1.10-0
## [22] jsonlite_1.7.2 broom_0.7.3 GO.db_3.11.4
## [25] dbplyr_2.0.0 ggforce_0.3.2 BiocManager_1.30.10
## [28] compiler_4.0.2 httr_1.4.2 rvcheck_0.1.8
## [31] backports_1.2.1 assertthat_0.2.1 Matrix_1.3-0
## [34] cli_2.2.0 tweenr_1.0.1 htmltools_0.5.0
## [37] prettyunits_1.1.1 tools_4.0.2 igraph_1.2.6
## [40] gtable_0.3.0 glue_1.4.2 reshape2_1.4.4
## [43] DO.db_2.9 fastmatch_1.1-0 Rcpp_1.0.5
## [46] enrichplot_1.8.1 Biobase_2.48.0 cellranger_1.1.0
## [49] vctrs_0.3.6 xfun_0.19 rvest_0.3.6
## [52] lifecycle_0.2.0 DOSE_3.14.0 europepmc_0.4
## [55] MASS_7.3-53 scales_1.1.1 hms_0.5.3
## [58] parallel_4.0.2 yaml_2.2.1 memoise_1.1.0
## [61] downloader_0.4 triebeard_0.3.0 stringi_1.5.3
## [64] RSQLite_2.2.1 S4Vectors_0.26.1 BiocGenerics_0.34.0
## [67] BiocParallel_1.22.0 rlang_0.4.9 pkgconfig_2.0.3
## [70] evaluate_0.14 lattice_0.20-41 labeling_0.4.2
## [73] cowplot_1.1.0 bit_4.0.4 tidyselect_1.1.0
## [76] plyr_1.8.6 magrittr_2.0.1 R6_2.5.0
## [79] IRanges_2.22.2 generics_0.1.0 DBI_1.1.0
## [82] pillar_1.4.7 haven_2.3.1 withr_2.3.0
## [85] modelr_0.1.8 crayon_1.3.4 utf8_1.1.4
## [88] rmarkdown_2.6 viridis_0.5.1 progress_1.2.2
## [91] grid_4.0.2 readxl_1.3.1 blob_1.2.1
## [94] reprex_0.3.0 digest_0.6.27 gridGraphics_0.5-1
## [97] stats4_4.0.2 munsell_0.5.0 viridisLite_0.3.0
## [100] ggplotify_0.0.5